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l> : ABSTRACT 

, Monte Carlo methods can provide accurate p-value estimates of word counting 

test statistics and are easy to implement. They are especially attractive when 
■ an asymptotic theory is absent or when either the search sequence or the word 

pattern is too short for the application of asymptotic formulae. Naive direct 
^ \ Monte Carlo is undesirable for the estimation of small probabilities because the 

^ , associated rare events of interest are seldom generated. We propose instead 

efficient importance sampling algorithms that use controlled insertion of the de- 
sired word patterns on randomly generated sequences. The implementation is 
illustrated on word patterns of biological interest: Palindromes and inverted re- 
peats, patterns arising from position specific weight matrices and co-occurrences 
of pairs of motifs. 

Key words: co-occurrences of motifs, importance sampling, Monte Carlo, motifs, palin- 
dromes, position specific weight matrices, p-values, transcription factor binding sites. 
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1. INTRODUCTION 



Searching for matches to a word pattern, also called a motif, is an important task in com- 
putational biology. The word pattern represents a functional site, such as a transcription 
factor binding site (TFBS) in a promoter region of a DNA sequence or a ligand docking site 
in a protein sequence. Statistical significance of over-representation of these word patterns 
provides valuable clues to biologists and as a result, there have been a lot of work done on 
the use of asymptotic limiting distributions to approximate these p-values, see Prum et al. 
(1995), Reinert et al. (2000), Rcgnicr (2000), Robin et al. (2002), Huang et al. (2004), Leung 
et al. (2005), Mitrophanov and Borodovsky (2006), Pape et al. (2008) and references therein. 
However, the approximations may not be accurate for short words or for words consisting 
of repeats and most theoretical approximations work only in specific settings. String-based 
recursive methods can provide exact p- values, see for example Gusfield (1997), but they can 
be computationally expensive when the number of words in the word pattern is large. 

Direct Monte Carlo algorithms for estimating p-values of word patterns are easy to 
implement but are inefficient for the estimation of very small p-values because in such cases, 
almost all the simulated sequences do not contain the required number of word patterns. We 
propose in this paper importance sampling algorithms that insert the desired word patterns 
either randomly or controlled by a hidden Markov model, on the simulated sequences. The 
algorithms are described in Section 2 and arc illustrated on several word patterns of biological 
interest: Palindromes and inverted repeats in Section 3, high-scoring words with respect to 
position specific weight matrices in Section 4 and co-occurrences of motifs in Section 5. 
Numerical results show that variance reduction of several orders of magnitude are achieved 
when applying the proposed importance sampling algorithms on small p-values. The technical 
details are consolidated in the appendices and include a proof of the asymptotic optimality 
of the importance sampling algorithms, in Appendix D. 
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2. IMPORTANCE SAMPLING OF WORD PATTERNS 

2.1 Word counting 

Let \B\ denote the number of elements in a set B. By selecting randomly from a finite 
set we shall mean that each b £ B has probability \B\^^ of being selected. For any two 
sequences v = vi - ■ -Vm and u = ui - ■ - Ur, the notation vu shall denote the concatenated 
sequence vi ■ ■ ■ VmUi ■ ■ ■ Ur- We also denote the length of v by ^(v)(= m). Although we 
assume implicitly an alphabet X = {a, c, g, t}, representing the four nucleotide bases of DNA 
sequences, the algorithms can be applied on any countable alphabet, for example the alphabet 
of 20 amino acids in protein sequences. 

We will represent the word pattern of interest by a set of words V and assume that 
jV| < oo. Let s = si ■ ■ ■ Sn denote a sequence of DNA bases under investigation and let 
be the maximum number of non-overlapping words from V in = si • • • s^- We say that 
there exists a word in V at the end of if Sm~j+i ••• Sm £ V for some j > 0. Moreover, the 
smallest such j is the length of the shortest word at the end of s^- We have the recursive 
relations, for m > 1, 



with the initialization Nq = 0. We denote A^^^ simply by A^. It is also possible to modify (|2.ip 
to handle the counting of possibly overlapping words. 

2. 2 Monte Carlo evaluation of statistical significance 

We begin by describing direct Monte Carlo. To evaluate the signifiance of observing c 
word patterns in an observed sequence s, we generate independent copies of the sequence 
from a Markov chain with transition probabilities estimated either from s or from a local 
neighborhood of s. The proportion of times {N > c} occurs among the independent copies 
of s is then the direct Monte Carlo estimate of the p-value pc '■= P{N > c}. 

It is quite common for many sequences to be analyzed simultaneously. Hence to correct 
for the effect of multiple comparisons, a very small p-value is required for any one sequence 
before statistical significance can be concluded. Direct Monte Carlo is well-known to be 
very inefficient for estimating small probabilities in general and many importance sampling 
schemes have been proposed to overcome this drawback, for example in sequential analysis 
(Siegmund, 1976), communication systems (Cottrell, Fort and Malgouyres, 1983), bootstrap- 
ping (Johns, 1988 and Do and Hall, 1992), signal detection (Lai and Shan, 1999), moderate 



(2.1) 




Nm-j + 1 if the shortest word in V at the end of is of length j. 



if there is no word in V at the end of s. 
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deviations (Fuh and Hu, 2004) and scan statistics (Chan and Zhang, 2007). In this paper, we 
provide change of measures that are effective for the importance samphng of word patterns. 

For ease of exposition, assume that the background sequence of bases follows a first-order 
Markov chain with positive transition probabilities 

(2.2) a{xy) := P{si+i = y\si = x}, x,y e X. 

Let vr be the stationary distribution and let (t(vi • ■ -Vi) = 0}=! ^(^i'^'j+i)- Before executing 
the importance sampling algorithms, we first create a word bank of the desired word pattern, 
with each word in the word bank taking the value v € V with probability g(v) > 0. The 
procedure for the selection of q and construction of the word banks will be elaborated in 
Sections 3-5. For completeness, we define q{v) = when v ^ V. Let /3(v) = q{v)/a{v). For 
case of computation, we shall generate a dummy variable sq before generating s and denote 
So • • • Sn by sq. The first importance sampling algorithm, for the estimation of pi only, is as 
follows. 

ALGORITHM A (for c = 1). 

1. Select a word v randomly from the word bank. Hence the word takes the value v G V 
with probability (/(v). 

2. Select io randomly from {1, . . . , n — £{\) + 1}. 

3. Generate sq from the stationary distribution and si, . . ., Si^-i sequentially from the 
underlying Markov chain. Let s^g • • • SjQ_|_^(v)-i = v and generate SjQ_|_£(v)) • • •Sn se- 
quentially from the underlying Markov chain. 

Let imin = minvgv^(v) and imax = maxvev^(v). Recall that (3{v) = for v ^ V. Then 

■^max 

(2.3) L{so):= J2 {ri-e + 1)-^ ^ P{si ■ ■ ■ Si+e^i)/a{si^iSi) 

is the likelihood ratio of generating Sq from Algorithm A and from the underlying Markov 
chain (with no insertion of word patterns). If Algorithm A is run independently K times, 

(k) 

with the kth copy of Sq generated denoted by Sq > then 

K 

(2.4) pi:=i^-i^L-i(s;,'V{ivW>c} 

fe=i 

is unbiased for pc- The term l{jv('^)>c} superfluous when using Algorithm A since at least 
one word pattern from V is generated in every copy of Sq. 
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We restrict Algorithm A to c = 1 because the random insertion of more than one word 
patterns into the simulated sequence can result in a hard to compute likelihood ratio. To 
handle more general c, we use a hidden Markov model device in Algorithm B below, with 
hidden states Xi taking either value (do not insert word pattern) or 1 (insert word pattern) , 
so that the likelihood ratio can be computed recursively. Let 

(2.5) p, = P{Xi = l\so---Si} 

be the word insertion probability at position i + 1 along the DNA sequence. For example, the 
user can simply select pi = c/n for all i so that approximately c word patterns are inserted 
in each generated sequence sq. Each copy of sq is generated in the following manner. 

ALGORITHM B (for c > 1). 

1. Let i = 0, generate sq from the stationary distribution and Xq satisfying (j2.5p . 

2. (a) If Xi = 1, select a word v randomly from the word bank. If £{v) < n — i, that is, 

if the word v can fit into the remaining sequence, let Sj+i • ■ ■ Sj_(_^(v) = v, generate 
^i+e{v) according to (|2.5|) . increment i by ^(v) and go to step 3. 

(b) If the word selected in 2(a) cannot fit into the remaining sequence or if Xi = 
0, generate Sj+i from the underlying Markov chain and Xj+i satisfying (j2.5p . 
Increment « by 1 and go to step 3. 

3. If i < n, repeat step 2. Otherwise, end the recursion. 

Let Li = Li{sQ ■ ■ ■ Si) be the likelihood ratio of generating sq - ■ ■ Si from Algorithm B 
and from the underlying Markov chain. Let 7j = X]v6V-^(v)<j 'i'(^) probability that a 

randomly chosen word from the word bank has length not exceeding j. Then 

■^max 

(2.6) Li = {I - pi_i-in-i+i)Li-i + ^ pi-iLi_il3{si-t+i- ■ ■ Si)/a{si-iSi-i+i) \l i>l, 
with Lj = for i <0. 

The estimator (|2.4p . with L = Ln, is unbiased if and only if all configurations of sq 
satisfying N > c can be generated via Algorithm B. To ensure this, it suffices for us to 
impose the constraint 

(2.7) Pi <l for alH < n - ^mm(c - Ni), 
so that we do not force the insertion of too many word patterns. 
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3. PALINDROMIC PATTERNS AND INVERTED REPEATS 



Masse et al. (1992) reported clusters of palindromic patterns near origin of replications 
of viruses. There have been much work done to estimate their significance, for example using 
Poisson and compound Poisson approximations, see Leung et al. (1994, 2005). The four 
nucleotides can be divided into two complementary base pairs with a and t forming a pair 
and c and g forming the second pair. We denote this relation by writing a'^ = t, = a, 
= g and g'^ = c. For a word = ui - ■ ■ Um, we define its complement = • • • . A 
palindromic pattern of length £ = 2m is a DNA sequence that can be expressed in the form 
Umujjj. For example, v = acgcgt is a palindromic pattern. Note that the complement of v, 
that is the word obtained by replacing each letter of v by its complement, is tgcgca, which 
is just V read backwards. This interesting property explains the terminology "palindromic 
pattern" . 

Inverted repeats can be derived from palindromic patterns by inserting a DNA sequence 
of length d in the exact middle of the pattern. The class of word patterns for inverted repeats 
can be expressed in the form 

(3.1) V = {u,„z<, : di < liz) < d2}, 

with < di < d2. When di = d2 = 0, then (j3.ip is the class of all palindromic patterns of 
length 2m. 

The construction of word banks for palindromic patterns is straightforward. It all boils 
down to generating in some suitable manner. We advocate generating with probabil- 
ity proportional to 7r(ui)(T(Um)(7(u^) or Tr{ui)a{umW^) and show how this can be done in 
Appendix A. 

Having a word bank for palindromic patterns allows us to create a word bank for inverted 
repeats easily. The procedure is as follows. 

1. Select Um^m randomly from a word bank of palindromic patterns and d randomly from 
{di, . . .,d2}. 

2. Let zq = Um and generate zi, . . . ,Zd sequentially from the underlying Markov chain. 

3. Store the word u^z^uj^ into the word bank for inverted repeats. 

This procedure allows jj, see ()2.6p . to be computed easily. In particular, 7_,- = (j — di + 
l)/(f^2 — di + 1) for di < j < d2, 7j = for j < di and 7j = 1 for j > d2- 
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4. POSITION SPECIFIC WEIGHT MATRIX (PSWM) 



PSWMs are commonly used to derive fixed-length word patterns or motifs that tran- 
scription factors bind onto and usually range from four to twenty bases long.. Databases such 
as TRANSFAC, JASPAR and SCPD curate PSWMs for families of transcription factors. For 
example, the PSWM for the SWI5 transcription factor in the yeast genome is 



(4.1) 



v 
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2 
3 



4 1 

1 1 
2 

2 3 



2 A 

7 

2 7 7 5 

5 7 y 



see Zhu and Zhang (1999). Let Wi{v) denote the entry in a PSWM that corresponds to base 
V at column i and let m be the number of columns in the PSWM. For any word (of length 
m), a score 

m 



[Vi 



i=l 



is computed and words with high scores are of interest. We let V be the set of all with 
score not less than a pre-specified threshold level t. In other words, 



(4.2) 



V = {v^ : S{Vm) > t} 



is a motif for the PSWM associated with a given transcription factor. The matrix is derived 
from the frequencies of the four bases at various positions of known instances of the TFBS, 
which are usually confirmed by biological experiments. Huang et al. (2004) gave a good 
review of the construction of PSWMs. 

In principle, we can construct a word bank for V by simply generating words of length 
m from the underlying Markov chain and discarding words that do not belong to the motif. 
However for t large, such a procedure involves discarding a large proportion of the generated 
words. It is more efficient to generate the words with a bias towards larger scores. In 
Appendix B, we show how, for any given ^ > 0, a tilted Markov chain can be constructed to 
generate words v with probability mass function 

s^^W7r(i;i)cT(v)/A(e), 



(4.3) 



Kv) 



where A(^^) is a computable normalizing constant. If words with scores less than t are dis- 
carded, then the probability mass function of non-discarded words is 



(4.4) 



g(v) = Ce^^W7r(i;i)(7(v)/A(0) for v G V, 
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where ^ is an unknown normalizing constant that can be estimated by the reciprocal of the 
fraction of non-discarded words. There are two conflicting demands placed on the choice of 
6. As 9 increases, the expected score of words generated under qe{v) increases. We would 
thus like 9 to be large so that the fraction of discarded words is small. However at the same 
time, we would also like 9 to be small, so that the variation of /?(v) = q{v)/a{v) over v G V 
is small. Since 



(4.5) 



d 
d9 



'log MO)], 



we suggest choosing the root of the equation ^[logA(0)] = t. See Appendix B for more 
details on the the computation of A(9) and the numerical search of the root. 

4-1 Example 1 

We illustrate here the need for alternatives to analytical p-value approximations by 
applying Algorithm A on some special word patterns. Let P-,^ denotes probability with vi 
following stationary distribution tt. Huang et al. (2004) suggested an approximation, which 
for c = 1 reduces to 



(4.6) 



P{iV > 1} = 1 - (1 - PAS{^m) > t}) 



n— m+1 



Consider si, . . . , s„ independent and identically distributed random variables taking val- 
ues a, c, g and t with equal probabilities. Let 

/llllllllllll\ 
000000000000 
000000000000 
000000000000 



(4.7) 



rep 



V 



(4.8) 



w, 



norep 



/l00000011000^ 
010000100100 
001001000010 
000110000001 



and consider counting of words with score at least t for t = 9, 10 and 11. The approximation 
()4.6p is the same for both (j4.7p and (j4.8p but we know that the p-value when the PSWM is 
(|4.7p should be smaller due to the tendency of the word patterns to clump together. Of course, 
declumping corrections can be applied to this special case but this is not so straightforward 
for general PSWMs. Table 1 compares the analytical, direct Monte Carlo and importance 
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sampling approximations of P{N > 1} for (|4.7p and (j4.8p with n = 200. The simulations 
reveal substantial over-estimation of p- values for Wrcp when using ()4.6p . Algorithm A is 
able to maintain its accuracy over the range of t considered whereas direct Monte Carlo has 
acceptable accuracy only for t = 9. 

4-2 Example 2 

We implement Algorithm B here with 

(4.9) P. = min {l, ( . -V}, 

I \n — t — [c — i\i){m — 1) / ) 

where x"*" = max{0, x}. We choose pi in this manner to encourage word insertion when there 
are few bases left to be generated and the desired number of word patterns has not yet been 
observed. The motif consists of all words of length 12 having score at least 50 with respect 
to the PSWM (j4.ip . The transition matrix for generating the DNA sequence is 



(4.10) 



a 


^.3577 


.1752 


.1853 


.2818^ 


c 


.3256 


.2056 


.1590 


.3096 


9 


.2992 


.2180 


.2039 


.2789 


t 


(^.2381 


.1943 


.1905 


.3771^ 



and the length of the sequence investigated is n = 700. We see from Table 2 variance 
reduction of 10-100 times in the simulation of probabilities of order 10^^ to 10^^. For smaller 
probabilities, direct Monte Carlo does not provide an estimate whereas estimates from the 
importance sampling algorithm retain their accuracy. Although importance sampling takes 
about two times the computing time of direct Monte Carlo for each simulation run, the 
savings in computing time to achieve the same level of accuracy are quite substantial. 
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5. CO-OCCURRENCES OF MOTIFS 



For a more detailed sequence analysis of promoter regions, one can search for cis-regulatory 
modules (CRM) instead of single motifs. We define CRM to be a collection of fixed length 
motifs that are located in a fixed order in proximity to each other. They are signals for co- 
operative binding of transcription factors, and are important in the study of combinatorial 
regulation of genes. CRMs have been used successfully to gain a deeper understanding of 
gene regulation, cf. Chiang et al. (2003), Zhou and Wong (2004) and Zhang et al. (2007). We 
focus here on the simplest type of CRM: A co-occurring pair of high scoring words separated 
by a gap sequence of variable length. Let Si{-) be the score of a word of length m calculated 
with respect to a PSWM Wi, and 5*2 (•) the score of a word of length r calculated with respect 
to a PSWM W2- Let < di < ^2 < oo be the prescribed limits of the length of the gap and 
ti, t2 threshold levels for Wi and W2 respectively. The family of words for the co-occurring 
motifs is 



In Section 4, we showed how word banks for the motifs Vi := {v^ : Si{Vrn) > ti} and 
V2 := {Ur : S2{vLr) > ^2} are created. Let qi be the probability mass function for Vj. A word 
bank for V can then be created by repeating the following steps. 

1. Select and independently from their respective word banks. 

2. Select d randomly from {di, . . . , ^2}. Generate zi, . . . ,Z(i sequentially from the under- 
lying Markov chain, initialized at zq = Vm- 

3. Store w = v^z^Ur into the word bank. 

Let q be the probability mass function of the stored words. Then 



and hence /?(w) = g(w)/a(w) = (d2 - di + Pi{vm)f32iur)/cr{zdUi). 
5.1 Example 3 

The transcription factors SFF (with PSWM Wi) and MCMl (with PSWM W2) are 
regulators of the cell cycle in yeast, and are known to co-operate at close distance in the 
promoter regions of the genes they regulate, see Spellman et al. (1998). Their PSWMs can 



(5.1) 



V = {v^ZUr : S'l(v^) > ti, S'2(Ur) > t2, ^1 < ^(z) < C?2}- 



(5.2) 



g(w) = (d2 -di + l) '^qi{vm)cr{vm2.d)q2{ur) 
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be obtained from the database SCPD. Define V by (|5.ip with ti = 48, t2 = 110, di = and 
d2 = 100. We would hke to estimate the probabihty that the motif V appears at least once 
within a promoter sequence of length n = 700. The estimated probability using Algorithm 
A is 3.4 X 10~^ with a standard error of 3 x 10~^. The corresponding standard error for 
1000 direct Monte Carlo runs would have been about 2 x 10~^, which is large relative to the 
underlying probability. 

5.2 Structured Motifs 

These co-occurring motifs considered in Robin et al. (2002) consist essentially of fixed 
word patterns and separated by a gap of length d, with an allowance for the mutation 
of up to one base in ^my^v Th.6 motif c3-n be expressed 3-s 

(5.3) V = {v^zur : di < i{z) < d2, \{i ■ Vi / Xi}\ + \{i : Ui / yi}\ < 1}. 

We create a word for the word bank of V in the following manner. 

1. Select k randomly from {0, . . . ,m + r}. If k = 0, then there is no mutation and we let 
v^Ur = XmYr- Otherwise, change the A;th base of XmYr equally likely into one of the 
three other bases and denote the mutated sequence as v^u^. 

2. Select d randomly from {di, . . . , ^2} and generate the bases of z = zi ■ ■ ■ sequentially 
from the underlying Markov chain, initialized at zq = Vm- 

We perform a simulation study on eight structural motifs selected for their high frequency 
of occurrences in part of the Bacillus suhtilis DNA dataset. We consider (^1,^2) = (16,18) 
and (5, 50), with length of DNA sequence n = 100, and a Markov chain with transition matrix 



0.35 


0.16 


0.18 


0.31 


0.33 


0.20 


0.15 


0.32 


0.32 


0.22 


0.19 


0.27 


0.25 


0.20 


0.19 


0.35 



In Table 3, we compare importance sampling estimates of P{N > 1} using Algorithm A 
with analytical p-value estimates from Robin et al. (2002) and direct Monte Carlo p-value 
estimates. The analytical p-value estimates are computed numerically via recursive methods 
with computation time that grows exponentially with d2 — di, and are displayed only for the 
case {di,d2) = (16,18). 

We illustrate here how the importance sampling algorithms can be modified to handle 
more complex situations, for example, to obtain a combined p-value for all eight motifs. 
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Consider more generally p = P{maxi<j<j(A^(-') — cj) > 0}, where A^^-^^ is the total word 
count from the motif V^-'^ and cj is a positive integer. Let L^^^ be the likelihood ratio when 
applying either Algorithm A or B with insertion of words from V'--'-'. For the kth simulation 
run, we execute the following steps. 

1. Select jk randomly from {1, . . . , J}. 

2. Generate Sq^^ using either Algorithm A or B, with insertion of words from V*--^^. 
Then 

K 

is unbiased for p, see Appendix C. The key feature in (|5.4p is the correction term \{j : 
ArO)(s[)''^) > Cj}\. Without this term, Pj is an unbiased estimator for the Bonferroni upper 
bound X]j=i P{N^^^ > cj}. The correction term adjusts the estimator downwards when more 
than one thresholds cj are exceeded. 

We see from Table 3 that the variance reduction is substantial when importance sam- 
pling is used. In fact, the direct Monte Carlo estimate is often unreliable. Such savings in 
computation time is valuable both to the end user and also to the researcher trying to test 
the reliability of his or her analytical estimates on small p- values. We observe for example 
that the numerical estimates for (^1,^2) = (16,18) given in Robin et al. (2002) are quite 
accurate but tends to underestimate the true underlying probability. 
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6. DISCUSSION 



The examples given here are not meant to be exhaustive but they do indicate how we can 
proceed in situations not covered here. For example, if wc would like the order of the two 
words in a CRM to be arbitrary, we can include an additional permutation step in the 
construction of the word bank. In Section 5.2, we also showed how to simulate p- values of 
the maximum count over a set of word patterns. As we gain biological understanding, the 
models that we formulate for DNA and protein functional sites become more complex. Over 
the years, they have evolved from deterministic words to consensus sequences to PSWMs and 
then to motif modules. As probabilistic models for promoter architecture gets more complex 
and context specific, importance sampling methods are likely to be more widely adopted in 
the computation of p-values. 
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APPENDIX A 



We first show how words can be generated with probabihty mass function 

g(vm) = 7r(t;i)cT(vm)c7(v^)/r/, 

with 7] = Ylvni '^{''^i)^i^m)(^i^m) ^ Computable normahzing constant. Apply the backward 
recursive relations 

(A.l) r]i{x) = a{xy)(T{y^x'^)r]ij^i{y) for all x G and z = 1, . . . , m — 1, 

initialized with r]m{x) = 1 for all x. Then t] = Ylxex Q desired 
probability measure for generating with probability mass function q. Then the Markovian 
property 

Q{vi=x} = TT{x)r]i{x)/T], 
(A.2) Q{vi_^_i = y\vi = x} = a{xy)a{y''x'')r]i+i{y)/rii{x) ioi i= I,..., m-1, 

allows us to generate Vi sequentially via transition matrices. 

To generate words v^, with probability mass function q{vm) = '^(^i)c(vmV^)/r7, let 
ilmix) = a{xx'^) instead of rjm{x) = 1 and proceed with (|A.ip and ()A.2p . 

APPENDIX B 

Let S be the score with respect to a given PSWM W and let > 0. We provide here a 
quick recursive algorithm for generating v^, from the probability mass function 

(A.3) qe{^m) = e''^(^™)7r(z;i)a(v^)/A(0), 

with A(0) = 'Ylvm^^^^^'^^'^i'^'^)'^i^rn) a computable normalizing constant. Since logA(0) is 
convex, the solution of ^[logA(0)] = t can be found using a bijection search. We take note 
of the backward recursive relations 

A„(^,x) = e^"""(^), 
(A.4) K{0,x) = e^"''^^) ^a(xy)Ai+i(0,2/) for all X e A" and i = 1,... ,m- 1, 

y&X 

from which we can compute A(^) = "Y^xex '^{x)Ai{9, x). Let Q denote the desired probability 
measure for generating = vi ■ ■ ■ Vm from qg. By ()A.3P and ()A.4p . we can simply generate 
the letters Vi sequentially, using transition matrices defined by the Markovian relations 

Q{vi=x} = 7r(x)Ai(e,x)/A(0), 
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(A.5) Q{vi+i = v\v, = x} = e^"''("V(xy)Ai+i(e,y)/Ai(0,x) fori = l,...,m-l. 



APPENDIX C 



We shall show here that pi in (j5.4p is unbiased for p = P{msxi<:j<:j{N^^^ — Cj) > 0}. 
Let Aj = {sq : A^*--'^(so) > cj} and let Qj be a probabihty measure such that ^'^■'-'(so) = 
Qj{sQ)/ P{sq) > for any Sq G Aj. Let A = U^L^^ylj. Then with the convention 0/0 = 0, 

j-|:^,,{[LO)N)i-(^^-^)i,.„,,,,} ^ ^( g--;'-!;;] ) - . 

and hence is indeed unbiased. 
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APPENDIX D: ASYMPTOTIC OPTIMALITY 



To estimate p := P{N{s) > c} using direct Monte Carlo, simply generate K independent 
copies of s, denoted by s^^^ . . . ,s^^\ under the original probability measure P, and let 

K 

PD = ^ l{Ar(s(fc))>c}- 

k=l 

To simulate p using importance sampling, we need to first select a probability measure Q P 
for generating s^^^ . . . , s*^^). The estimate of p is then 

K 

n := ^"'(s''^)l{iV(sW)>c}> where L(s) = Q{s)/P{s). 

k=l 

We require Q{s) > whenever N{s) > c, so as to ensure that pi is unbiased for p. 

The relative error (RE) of a Monte Carlo estimator p = pj:, or pi, is given by y^Var(|>) /p. 
We say that p is asymptotically optimal if for any e > 0, we can satisfy RE < e with 
logic = o(| logpl) as p ^ 0, see Sadowsky and Bucklew (1990) and Dupuis and Wang (2005). 
Since RE(pd) = -^/(l — p)/{Kp), direct Monte Carlo is not asymptotically optimal. The 
question we would like to answer here is: Under what conditions are Algorithms A and B 
asymptotically optimal? 

The examples described in Sections 3-5 involve word families that can be characterized 
as Vm- We may also include an additional subscript m to a previously defined quantity to 
highlight its dependence on m, for example Pm, qm, Pm and rim- We say that Xm and ym have 
similar logarithmic value relative to m, and write 2± yjj^, if llogx^ - logy^l = o(m) as 
m — ^ CO. It is not hard to see that if Xm — Vm and y^n — z^, then x^ — Zm- In Algorithm 
A, it is assumed implicitly that rim ^ ^max(= ^max,m) ■= niaxvgVm ^(v) and we shall also 
assume rim > c£max when using Algorithm B. To fix the situation, let pi = c/rim for all i 
in Algorithm B. Let /3min(= Pmin,m) = miuvev^ /^m ( v) , /3max(= Pmax,m) = niaxvgv^/3m(v), 

0-min = lOain^^y^x Cr{xy){> 0), (Jmax = m8iXx,yeX (Tixy){< 1) and TTmin = mina;g;t'7r(x)(> (Tmin)- 

Let [-J denote the greatest integer function, P^ denote probability conditioned on si = a; or 
vi = X and denote probability conditioned on si or vi following the stationary distribution. 

In the following lemma, we provide conditions for asymptotic optimality and check them 
in Appendices D.1-D.3 for the word families discussed in Sections 3-5. 
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Lemma 1. If log Um — 1 and 

(A. 6) pm < a™ for some < a < 1, 

(A. 7) ^max — 1) 

(A.8) ^ ( ct(v))"\ 

veVm 

^/len both Algorithms A and B are asymptotically optimal. 

Proof Let = Yjx^x P^i^i G for some £ > 1}. Since ZlveV^ ^(v) > ^'m > 
Evev™ ^(v), by (IMl) and jAl), 

(A.9) - E ^(^) =^ /^min- 

veVm 

By (6.1), I logprrti ^ ""T-l logo I for all large m and hence it suffices for us to show Km ~ 1- 
If rim — 1, then by (jA.OP and the inequalities ^ Pm > (fmin^'m)'^; 

(A.IO) (^m/'min)'' - {nmrniY ^ Pm- 

Consider next the case rim/-^max ^ oo. Since logrim — 1, there exists integers such that 

im - 1, Cm = o(nm,) and lognm = 0{im)- Let Km = L"-m/(^max + Cm)J and Qm = Pivist G Vm 

for some I > 1}. By ()A.6p . a™ > > (s'mCmin)'^ and hence Qm — > 0. Since the underlying 
Markov chain is uniformly ergodic, 

(A. 11) sup \Px{sk+i = y} — '^iy)\ ^ some < r/ < 1. 

By considering the sub-cases of at least c words v G Vm starting at positions 1, (^max + Cm) + 
1, . . . , {Km - l)(^max + Cm) + 1, it follows from (lA.lip that 



Pm > 1 - E 5^(1 - <7m)'^--^' - {Km " l)r/^™ = 1 - (1 + o(l)) ^ ^^^e^-^™ - o(l). 



By ()A.6p . Kmgm and this implies Hmrm 0. Since (^max + Cm) — 1, it follows that 
Km — rim and hence by the inequalities 



I m — — \ I y-'vcnn' m) \^ ' m) > 



(jA.lOp again holds. By using a subsequence argument if necessary, it follows that (jA.lOp 
holds as long as logn^ ~ 1. 
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For Algorithm A, by ([O]) and ([TI]) . 

RE(pi) < p~^i^^^/2supL"^(s)l{7v(s)>l} < Pm^^m^^^?^mO-max/?min 
s 

and the desired relation Km ~ 1 fohows from (jA.lOp with c = 1. 

For Algorithm B, it follows from (|2.6p that if A^(s) > c, then L(s) > (1 - c/nm)"™ 
x[c/3min/(nmfTmax)]'' and hence by ([231), 

RE(pi) <p„ii^^i/2supL-i(s)l{7v(s)>c} < {l + o{l))Pm^K;;^^/^[enmam^Jic(3minW, 

s 

and again Km — 1 follows from (|A.10p . □ 
D.l Inverted repeats 

Consider the word family (j3.ip with d2 ~ 1- Then ()A.7p holds. Since < (^2 — 
dijUmfy"^^^ 1 (|A.6p holds when = 0(7™) for some 7 < cy^^^- It remains to check (jA.Sp . 
Since Evgv„ 9rn(v) = EvgV„ ^m(v)fT(v) = 1, 

(A.12) < ( ^ a(v)) ^ < /3„,ax. 

v6Vm 

Let Mm be generated with probability proportional to 7r(tii)(T(um)(T(u5^) when creating the 
word bank Vm- Then there exists a constant C > such that 

/3m(UmZU;^) = C7r(ui)a(UmZ)cr(u;^)/cr(UmZU;^) = C7r(ui)/c7(ZdUi). 

Hence /3niin — /5max and (jA.SP follows form ()A.12p . 

D.2 Word patterns derived from PSWMs 

For the word family (14. 2p . condition ()A.7[) is always satisfied. Let the entries of the 
PSWM be non-negative integers and assume that the column totals are fixed at some C > 0. 
It follows from large deviations theory, see for example Dembo and Zeitouni (1998), that if 
t{= tm) > Et^S{v) + Qm for some C > 0, then 

(A.13) P^{5(v) >t} = 0(A") for some < A < 1. 

Since pm < ?^m^'7r{5'(v) > t}, ()A.6P holds if = 0(7™) for some 7 < A^^. 

To simplify the analysis in checking (jA.Sp . select the tilting parameter 6{= 9m) to be 
the root of ^^^^[^(v)] = t + 6m for some positive 5m = o(m) satisfying m^'^/'^Sm — > cxo as 
m 00, instead of the root of Eqg[S{\')] = t, as suggested in the statement containing ()4.5p . 
The implicit assumption is that ^™ ^^jmax^g^ t(;i(t')} > t + 5m for all m. Since the entries 
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of the transition matrices derived in Appendix B are uniformly bounded away from zero, it 
follows from a coupling argument that CoVqg{wi{vi),Wj{vj)) = 0(rl*^''l) for some < r < 1 
and hence Varqg(5(v)) = 0{m). By (|4.3p and Chebyshev's inequality, 

(A.14) e^(*+2^-) ^ a(v) / A{e) > J] qeM > 1 - CVar,(5(v)) > 

veVm V;|5(v)-t-<5m|<<5„ 

for all large m. Since ^ > 1 in (|4.4|) . /J^in = minvGV,„ 'Zm(v)/cr(v) > e^*7rmin/A(0) and (jA.Sp 
follows from ([02|) and (lAlll . 

Co-occurrences of motifs 

Consider the word family (jS.ip with (r/m) bounded away from zero and infinity and 
d2 ^ 1. We check that ^A7i]\ holds. If h > ESi{v) + (m for some C > 0, then ([Al3l) 
holds with S replaced by 5i, t replaced by ti and hence ()A.6P holds if rim = 0(7"*) for some 
7 < A-i. 

Let 6j be the root of Eg. [Sj{v)] = tj+6m for some positive 5m = o{m) with m^^'^dm 00, 
J = 1 and 2, assuming that Yl^ii^^^v&x wl''\v)} > tj + 6m, where mi = m and m2 = r. 
Let V^^^ = {v^ : 5i(v„) > ti], V?'^ = {u, : 52(u,) > ta} and let A(i)(6'i), A(2)(^2) be their 
respective normalizing constants, see ()4.3p . By the arguments in (|A.14p . 

E^(v)>^min( 5: ^(V))( a(u))=e-^^*-^^*^+°(-)A«(^l)A(2)(^2). 

By ([52]), /3mm > e^i*i+^2t2^-i^2^.^/{A(i)(0i)A(2)(02)} and hence (jXH]) follows from ([02]) . 
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t 


9 


10 


11 


Analytical 


7.1x10" 


-2 


7.1x10-3 


4.2x10" 


-4 


Wrep: Direct MC 


(3.6 ± .6) X 10" 


-2 


(5 ± 2) X 10-3 







Algorithm A 


(3.0 ±.1) X 10- 


-2 


(4.0 ± .2) X 10-3 


(2.7 ± .1) X 10" 


-4 


W^norep: Direct MC 


(6.7 ±.8) X 10" 


-2 


(9 ± 3) X 10-3 


(1±1) X 10" 


-3 


Algorithm A 


(7.5 ±.2) X 10- 


-2 


(6.9 ± .2) X 10-3 


(4.1 ± .1) X 10" 


-4 



Table 1: Comparisons of analytical, direct Monte Carlo and importance sampling approx- 
imations for P{N > 1} with n = 200 in Example 1. Each Monte Carlo entry is obtained 
using 1000 simulation runs and are expressed in the form p± standard error. 



c 


Direct MC 


Algorithm B 


1 


(9.6±.9) X 10-2 


(9.1±.3) X 10-2 


2 


(3±2) X 10-3 


(4.2±.2) X 10-3 


3 





(1.3 ± .1) X 10-^ 


4 





(2.6 ± .3) X 10-6 



Table 2: pistandard error for Example 2 with 1000 copies of Sq generated for both direct 
Monte Carlo and importance sampling using Algorithm B. 
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di 


d2 


X 


y 


Direct MC 


Algorithm A 


Analytic 


16 


18 


gttgaca 


atataat 


(2±1) 


X 


10" 


-4 


(1.038 ± 0.006) 


X 


10-^ 


1.01 X 


10" 


-4 






gttgaca 


tataata 











(9.00 ± 0.05) 


X 


10-^ 


8.82 X 


10- 


-5 






tgttgac 


tataata 


(20 ± 10) 


X 


10" 


-5 


(9.39 ± 0.05) 


X 


10-^ 


9.20 X 


10- 


-5 






ttgaca 


ttataat 


(9 ±3) 


X 


10- 


-4 


(6.65 ± 0.03) 


X 


10-4 


6.55 X 


10- 


-4 






ttgacaa 


tacaat 


(4 ±2) 


X 


10' 


-4 


(4.64 ± 0.02) 


X 


10-4 


4.57 X 


10' 


-4 






ttgacaa 


tataata 


(2± 1) 


X 


10- 


-4 


(1.798 ± 0.009) 


X 


10-4 


1.78 X 


10- 


-4 






ttgacag 


tataat 


(5 ±2) 


X 


10" 


-4 


(3.62 ± 0.02) 


X 


10-4 


3.59 X 


10- 


-4 






ttgacg 


tataat 


(10 X 3) 


X 


10- 


-4 


(9.90 ± 0.06) 


X 


10-4 


9.76 X 


10- 


-4 






combined 


p- value 


(2.0 ± 0.4) 


X 


10" 


-3 


(2.96 ± 0.03) 


X 


10-3 








5 


50 


gttgaca 


atataat 


(1±0.3) 


X 


10" 


-3 


(1.265 ± 0.008) 


X 


10-3 












gttgaca 


tataata 


(0.4 ±0.2) 


X 


10- 


-3 


(1.103 ± 0.007) 


X 


10-3 












tgttgac 


tataata 


(1.8 ±0.4) 


X 


10- 


-3 


(1.150 ± 0.007) 


X 


10-3 












ttgaca 


ttataat 


(7.4 ± 0.9) 


X 


10- 


-3 


(7.88 ± 0.05) 


X 


10-3 












ttgacaa 


tacaat 


(5.0 ± 0.7) 


X 


10- 


-3 


(5.50 ± 0.04) 


X 


10-3 












ttgacaa 


tataata 


(1.5 ±0.4) 


X 


10- 


-3 


(2.21 ±0.01) 


X 


10-3 












ttgacag 


tataat 


(3.1 ±0.6) 


X 


10" 


-3 


(4.23 ± 0.03) 


X 


10-3 












ttgacg 


tataat 


(0.9 ±0.1) 


X 


10" 


-2 


(1.126 ±0.008) 


X 


10-2 












combined 


p- value 


(2.7 ±0.2) 


X 


10- 


-2 


(3.30 ± 0.04) 


X 


10-2 









Table 3: Comparison of direct Monte Carlo, importance sampling and analytical estimates 
of P{N > 1} for structured motifs. For both direct Monte Carlo and importance sampling, 
10,000 simulation runs are executed for each entry and the results are displayed in the form 
p±standard error. 



23 



